Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers
نویسندگان
چکیده
This paper describes a new method, COMBI-BOOTSTRAP, to exploit existing taggers and lexical resources for the annotation of corpora with new tagsets. COMBI-BOOTSTRAP uses existing resources as features for a second level machine learning module, that is trained to make the mapping to the new tagset on a very small sample of annotated corpus material. Experiments show that COMBI-BOOTSTRAP: i) can integrate a wide variety of existing resources, and ii) achieves much higher accuracy (up to 44.7 % error reduction) than both the best single tagger and an ensemble tagger constructed out of the same small training sample.
منابع مشابه
Analysis and Development of Urdu POS Tagged Corpus
In this paper, two corpora of Urdu (with 110K and 120K words) tagged with different POS tagsets are used to train TnT and Tree taggers. Error analysis of both taggers is done to identify frequent confusions in tagging. Based on the analysis of tagging, and syntactic structure of Urdu, a more refined tagset is derived. The existing tagged corpora are tagged with the new tagset to develop a singl...
متن کاملTagging the Past: Experiments using the Saga Corpus
There is an increasing interest in the NLP community in developing tools for annotating historical data, for example, to facilitate research in the field of corpus linguistics. In this work, we experiment with several PoS taggers using a sub-corpus of the Icelandic Saga Corpus. This is carried out in three main steps. First, we evaluate taggers, which were trained on Modern Icelandic, when tagg...
متن کاملImproving Tagging Performance by Using Voting Taggers
We present a bootstrapping method to develop an annotated corpus, which is specially useful for languages with few available resources. The method is being applied to develop a corpus of Spanish of over 5Mw. The method consists on taking advantage of the collaboration of two different POS taggers. The cases in which both taggers agree present a higher accuracy and are used to retrain the taggers.
متن کاملBootstrapping a Swedish Treebank Using Cross-Corpus Harmonization and Annotation Projection
In this paper, we describe an ongoing project with the aim of bootstrapping a large Swedish treebank, ultimately with a size of about 1.5 million tokens, by reusing two previously existing annotated corpora: an old treebank of about 350,000 tokens and a more recently developed part-of-speech-tagged corpus of about 1,2 million words. A key component in the bootstrapping methodology is the use of...
متن کاملPart-of-Speech Tagging of Transcribed Speech
We used four Part-of-Speech taggers, which are available for research purposes and were originally trained on text to tag a corpus of transcribed multiparty spoken dialogues. The assigned tags were then manually corrected. The correction was first used to evaluate the four taggers, then to retrain them. Despite limited resources in time, money and annotators we reached results comparable to tho...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره cs.CL/0007018 شماره
صفحات -
تاریخ انتشار 2000